AITopics | Amran Governorate

Collaborating Authors

Amran Governorate

Foundations of Symbolic Languages for Model Interpretability Marcelo Arenas 1,4, Daniel Baez

Neural Information Processing SystemsAug-14-2025, 19:38:43 GMT

Several queries and scores have been proposed to explain individual predictions made by ML models. Examples include queries based on "anchors", which are parts of an instance that are sufficient to justify its classification, and "feature-perturbation" scores such as SHAP .

decision tree, foil, query, (16 more...)

Neural Information Processing Systems

Country:

South America > Chile (0.04)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(2 more...)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.47)
Information Technology > Artificial Intelligence > Representation & Reasoning > Logic & Formal Reasoning (0.46)

Add feedback

Theoretical Investigation on Inductive Bias of Isolation Forest

Zheng, Qin-Cheng, Zhang, Shao-Qun, Lyu, Shen-Huan, Jiang, Yuan, Zhou, Zhi-Hua

arXiv.org Machine LearningMay-20-2025

Isolation Forest (iForest) stands out as a widely-used unsupervised anomaly detector valued for its exceptional runtime efficiency and performance on large-scale tasks. Despite its widespread adoption, a theoretical foundation explaining iForest's success remains unclear. This paper theoretically investigates the conditions and extent of iForest's effectiveness by analyzing its inductive bias through the formulation of depth functions and growth processes. Since directly analyzing the depth function proves intractable due to iForest's random splitting mechanism, we model the growth process of iForest as a random walk, enabling us to derive the expected depth function using transition probabilities. Our case studies reveal key inductive biases: iForest exhibits lower sensitivity to central anomalies while demonstrating greater parameter adaptability compared to $k$-Nearest Neighbor anomaly detectors. Our study provides theoretical understanding of the effectiveness of iForest and establishes a foundation for further theoretical exploration.

anomaly, data mining, machine learning, (20 more...)

arXiv.org Machine Learning

2505.12825

Country:

Asia > China > Jiangsu Province > Nanjing (0.04)
Asia > Middle East > Yemen > Amran Governorate > Amran (0.04)

Genre: Research Report (1.00)

Industry: Information Technology (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Data Science > Data Mining > Anomaly Detection (0.90)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.70)

Add feedback

Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical-ML Fusion

Nasr-Esfahany, Arash, Alizadeh, Mohammad, Lee, Victor, Alam, Hanna, Coon, Brett W., Culler, David, Dadu, Vidushi, Dixon, Martin, Levy, Henry M., Pandey, Santosh, Ranganathan, Parthasarathy, Yazdanbakhsh, Amir

arXiv.org Artificial IntelligenceMar-29-2025

Cycle-level simulators such as gem5 are widely used in microarchitecture design, but they are prohibitively slow for large-scale design space explorations. We present Concorde, a new methodology for learning fast and accurate performance models of microarchitectures. Unlike existing simulators and learning approaches that emulate each instruction, Concorde predicts the behavior of a program based on compact performance distributions that capture the impact of different microarchitectural components. It derives these performance distributions using simple analytical models that estimate bounds on performance induced by each microarchitectural component, providing a simple yet rich representation of a program's performance characteristics across a large space of microarchitectural parameters. Experiments show that Concorde is more than five orders of magnitude faster than a reference cycle-level simulator, with about 2% average Cycles-Per-Instruction (CPI) prediction error across a range of SPEC, open-source, and proprietary benchmarks. This enables rapid design-space exploration and performance sensitivity analyses that are currently infeasible, e.g., in about an hour, we conducted a first-of-its-kind fine-grained performance attribution to different microarchitectural components across a diverse set of programs, requiring nearly 150 million CPI evaluations.

artificial intelligence, machine learning, modeling & simulation, (15 more...)

arXiv.org Artificial Intelligence

2503.23076

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
Asia > Middle East > Yemen > Amran Governorate > Amran (0.04)

Genre: Research Report (1.00)

Industry: Energy (0.93)

Technology:

Information Technology > Modeling & Simulation (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions

Li, Yubo, Miao, Yidi, Ding, Xueying, Krishnan, Ramayya, Padman, Rema

arXiv.org Artificial IntelligenceMar-28-2025

Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent performance across multiple interaction rounds. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions. First, we propose a novel Position-Weighted Consistency (PWC) score that captures both the importance of early-stage stability and recovery patterns in multi-turn interactions. Second, we present a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce Confidence-Aware Response Generation (CARG), a framework that significantly improves response stability by incorporating model confidence signals into the generation process. Empirical results demonstrate that CARG significantly improves response stability without sacrificing accuracy, underscoring its potential for reliable LLM deployment in critical applications.

arxiv preprint arxiv, large language model, machine learning, (18 more...)

arXiv.org Artificial Intelligence

2503.22353

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
Asia > Middle East > Yemen > Amran Governorate > Amran (0.04)

Genre: Research Report > New Finding (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

LREF: A Novel LLM-based Relevance Framework for E-commerce

Tang, Tian, Tian, Zhixing, Zhu, Zhenyu, Wang, Chenyang, Hu, Haiqing, Tang, Guoyu, Liu, Lin, Xu, Sulong

arXiv.org Artificial IntelligenceMar-12-2025

Query and product relevance prediction is a critical component for ensuring a smooth user experience in e-commerce search. Traditional studies mainly focus on BERT-based models to assess the semantic relevance between queries and products. However, the discriminative paradigm and limited knowledge capacity of these approaches restrict their ability to comprehend the relevance between queries and products fully. With the rapid advancement of Large Language Models (LLMs), recent research has begun to explore their application to industrial search systems, as LLMs provide extensive world knowledge and flexible optimization for reasoning processes. Nonetheless, directly leveraging LLMs for relevance prediction tasks introduces new challenges, including a high demand for data quality, the necessity for meticulous optimization of reasoning processes, and an optimistic bias that can result in over-recall. To overcome the above problems, this paper proposes a novel framework called the LLM-based RElevance Framework (LREF) aimed at enhancing e-commerce search relevance. The framework comprises three main stages: supervised fine-tuning (SFT) with Data Selection, Multiple Chain of Thought (Multi-CoT) tuning, and Direct Preference Optimization (DPO) for de-biasing. We evaluate the performance of the framework through a series of offline experiments on large-scale real-world datasets, as well as online A/B testing. The results indicate significant improvements in both offline and online metrics. Ultimately, the model was deployed in a well-known e-commerce application, yielding substantial commercial benefits.

llm, preprint arxiv, relevance, (14 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3701716.3715246

2503.09223

Country:

Oceania > Australia > New South Wales > Sydney (0.05)
Asia > China > Beijing > Beijing (0.05)
North America > United States > New York > New York County > New York City (0.04)
Asia > Middle East > Yemen > Amran Governorate > Amran (0.04)

Genre: Research Report (0.50)

Industry: Information Technology > Services > e-Commerce Services (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering

Badshah, Sher, Sajjad, Hassan

arXiv.org Artificial IntelligenceMar-11-2025

Evaluating Large Language Models (LLMs) free-form generated responses remains a challenge due to their diverse and open-ended nature. Traditional supervised signal-based automatic metrics fail to capture semantic equivalence or handle the variability of open-ended responses, while human evaluation, though reliable, is resource-intensive. Leveraging LLMs as evaluators offers a promising alternative due to their strong language understanding and instruction-following capabilities. Taking advantage of these capabilities, we propose the Dynamic Arbitration Framework for Evaluation (DAFE), which employs two primary LLM-as-judges and engages a third arbitrator only in cases of disagreements. This selective arbitration prioritizes evaluation reliability while reducing unnecessary computational demands compared to conventional majority voting. DAFE utilizes task-specific reference answers with dynamic arbitration to enhance judgment accuracy, resulting in significant improvements in evaluation metrics such as Macro F1 and Cohen's Kappa. Through experiments, including a comprehensive human evaluation, we demonstrate DAFE's ability to provide consistent, scalable, and resource-efficient assessments, establishing it as a robust framework for evaluating free-form model outputs.

evaluation, reference answer, wang, (16 more...)

arXiv.org Artificial Intelligence

2503.08542

Country:

Europe > Germany (0.28)
North America > United States > Florida > Miami-Dade County > Miami (0.04)
Asia > Middle East > Jordan (0.04)
(12 more...)

Genre: Research Report > New Finding (1.00)

Industry:

Law > Alternative Dispute Resolution (1.00)
Health & Medicine (1.00)
Government (0.92)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.98)

Add feedback

ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition

Alyahya, Hisham A., Khan, Haidar, Alnumay, Yazeed, Bari, M Saiful, Yener, Bülent

arXiv.org Artificial IntelligenceMar-10-2025

We introduce ZeroSumEval, a dynamic, competition-based, and evolving evaluation framework for Large Language Models (LLMs) that leverages competitive games. ZeroSumEval encompasses a diverse suite of games, including security challenges (Capture the Flag), classic board games (chess), and knowledge tests (MathQuiz). These games are designed to evaluate a range of capabilities such as strategic reasoning, planning, knowledge application, safety, and adaptability. Building upon recent studies that highlight the effectiveness of game-based evaluations for LLMs, ZeroSumEval enhances these approaches by providing a standardized and extensible framework for easily implementing games and leverages DSPy to provide a better abstraction for LLM player strategies.

arxiv, wang, zhang, (15 more...)

arXiv.org Artificial Intelligence

2503.10673

Country:

Europe > Ukraine > Kyiv Oblast > Kyiv (0.04)
North America > United States > Texas (0.04)
North America > Canada > Newfoundland and Labrador > Labrador (0.04)
(4 more...)

Genre: Research Report (0.84)

Industry: Leisure & Entertainment > Games > Chess (0.49)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

What do Large Language Models Say About Animals? Investigating Risks of Animal Harm in Generated Text

Kanepajs, Arturs, Basu, Aditi, Ghose, Sankalpa, Li, Constance, Mehta, Akshat, Mehta, Ronak, Tucker-Davis, Samuel David, Zhou, Eric, Fischer, Bob

arXiv.org Artificial IntelligenceMar-9-2025

As machine learning systems become increasingly embedded in human society, their impact on the natural world continues to escalate. Technical evaluations have addressed a variety of potential harms from large language models (LLMs) towards humans and the environment, but there is little empirical work regarding harms towards nonhuman animals. Following the growing recognition of animal protection in regulatory and ethical AI frameworks, we present the Animal Harm Assessment (AHA), a novel evaluation of risks of animal harm in LLM-generated text. Our dataset comprises 1,850 curated questions from Reddit post titles and 2,500 synthetic questions based on 50 animal categories (e.g., cats, reptiles) and 50 ethical scenarios, with further 70-30 public-private split. Scenarios include open-ended questions about how to treat animals, practical scenarios with potential animal harm, and willingness-to-pay measures for the prevention of animal harm. Using the LLM-as-a-judge framework, answers are evaluated for their potential to increase or decrease harm, and evaluations are debiased for the tendency to judge their own outputs more favorably. We show that AHA produces meaningful evaluation results when applied to frontier LLMs, revealing significant differences between models, animal categories, scenarios, and subreddits. We conclude with future directions for technical research and the challenges of building evaluations on complex social and moral topics.

animal harm, category, claude-3, (15 more...)

arXiv.org Artificial Intelligence

2503.04804

Country:

North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
North America > United States > New York > New York County > New York City (0.04)
South America > Brazil > Rio de Janeiro > Rio de Janeiro (0.04)
(17 more...)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Law (1.00)
Health & Medicine (1.00)
Government (1.00)
(5 more...)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

A Cooperative Multi-Agent Framework for Zero-Shot Named Entity Recognition

Wang, Zihan, Zhao, Ziqi, Lyu, Yougang, Chen, Zhumin, de Rijke, Maarten, Ren, Zhaochun

arXiv.org Artificial IntelligenceFeb-25-2025

Zero-shot named entity recognition (NER) aims to develop entity recognition systems from unannotated text corpora. This task presents substantial challenges due to minimal human intervention. Recent work has adapted large language models (LLMs) for zero-shot NER by crafting specialized prompt templates. It advances model self-learning abilities by incorporating self-annotated demonstrations. However, two important challenges persist: (i) Correlations between contexts surrounding entities are overlooked, leading to wrong type predictions or entity omissions. (ii) The indiscriminate use of task demonstrations, retrieved through shallow similarity-based strategies, severely misleads LLMs during inference. In this paper, we introduce the cooperative multi-agent system (CMAS), a novel framework for zero-shot NER that uses the collective intelligence of multiple agents to address the challenges outlined above. CMAS has four main agents: (i) a self-annotator, (ii) a type-related feature (TRF) extractor, (iii) a demonstration discriminator, and (iv) an overall predictor. To explicitly capture correlations between contexts surrounding entities, CMAS reformulates NER into two subtasks: recognizing named entities and identifying entity type-related features within the target sentence. To enable controllable utilization of demonstrations, a demonstration discriminator is established to incorporate the self-reflection mechanism, automatically evaluating helpfulness scores for the target sentence. Experimental results show that CMAS significantly improves zero-shot NER performance across six benchmarks, including both domain-specific and general-domain scenarios. Furthermore, CMAS demonstrates its effectiveness in few-shot settings and with various LLM backbones.

dataset, demonstration, demonstration discriminator, (15 more...)

arXiv.org Artificial Intelligence

2502.18702

Country:

Oceania > Australia > New South Wales > Sydney (0.05)
Europe > Netherlands > North Holland > Amsterdam (0.05)
Europe > Netherlands > South Holland > Leiden (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents > Agent Societies (0.61)

Add feedback

Filters

Collaborating Authors

Amran Governorate

60cb558c40e4f18479664069d9642d5a-Paper.pdf

Foundations of Symbolic Languages for Model Interpretability Marcelo Arenas 1,4, Daniel Baez

Theoretical Investigation on Inductive Bias of Isolation Forest

Concorde: Fast and Accurate CPU Performance Modeling with Compositional Analytical-ML Fusion

Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions

LREF: A Novel LLM-based Relevance Framework for E-commerce

DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering

ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition

What do Large Language Models Say About Animals? Investigating Risks of Animal Harm in Generated Text

A Cooperative Multi-Agent Framework for Zero-Shot Named Entity Recognition